Audio interactive decomposition editor method and system

ABSTRACT

A distributed system and a corresponding data processing method are disclosed, for decomposing audio signals including mixed audio sources. The system comprises at least one client terminal, a remote queuing module and at least one remote audio data processing module connected in a network. A client terminal stores source audio signal data, selects at least one signal decomposition type, uploads source audio signal data with data representative of the decomposition type selection to the queuing module, and downloads decomposed audio signal data. The queuing module queues uploaded source audio data and distributes same to data processing module(s). The queuing module also queues uploaded decomposed audio signal data and distributes same to client terminal(s). An audio data processing module processes distributed source audio data into decomposed audio signal data according to the type selection, and uploads decomposed audio signal data to the at least one remote queuing resource.

The application claims the benefit of U.S. Provisional PatentApplication No. 62/949,662, filed 18 Dec. 2019, the specification ofwhich is hereby incorporated herein by reference.

BACKGROUND OF THE INVENTION Field of the Invention

The present invention relates to a distributed method and system fordecomposing digital audio signals incorporating multiple audio sourcesinto audio signals incorporating more discrete audio sources.

Description of Related Art

The combination of computing resources evolving according to Moore'sLaw, with the development and optimisation of algorithms embodied asimage data processing techniques, has seen a revolution in the abilityto manipulate digital image and video content, with considerable impactacross a broad range of applications and user groups, from hobbyists tomedia production houses and the movie and broadcast industries. Thisrevolution has generated unprecedented creative opportunities for thoseworking with image and video, enhancing their respective creativeworkflows through the provision of suites of image and video dataprocessing tools. This is perhaps best evidenced, albeit anecdotally, asthe common knowledge, and use, by the non-specialist public of imagedata editing applications such as Adobe® Photoshop®.

By contrast, advances in the field of audio signal editing have beenrelatively minor. One of the main limitations of existing audio dataediting applications is their restricted capacities as regardsdecomposing an audio signal that has been mixed from several underlyingaudio components or sources. While recent audio editors have introducedenhanced manipulation capabilities, such as the ability to edit audiospectrograms, they still do not offer the ability to accurately andeffectively decompose a mixed audio signal automatically, in a manneranalogous to the decomposition of an image by image data editingsoftware. For example, it is relatively easy for a novice user toextract a person from a background in a photograph and to composite theextracted person into a new background, because the state of the art inimage data processing algorithms that underlie such extraction andcompositing functions permits substantial automation of these tasks,with limited user input if any. Extracting a pitched instrument from arecording of an orchestral performance, such as trumpet, still cannot bedone in the same manner.

The majority of research into the decomposition of audio signals in aperceptually meaningful manner, has taken place in the context of soundsource separation (SSS). In any recorded audio signal, sound sources caninclude any one or more of the human voice, environmental noises such ascrowd noise, gunshot noise or car noise, and both pitched and unpitchedmusical instruments such as guitar, violin, drums and more. The soundsource separation (SSS) problem can thus be defined as, given arecording containing a mixture of different sound sources, such as aconversation recorded in a busy environment containing multiplespeakers, or a recording such as a popular music track with vocals, howto extract one or more individual sources from that mixture.

Audio recordings are typically available as stereo or mono signals, thusthere are usually many more sources present than mixtures available. TheSSS problem is thus an underdetermined problem, wherein no exactsolution is possible. Accordingly, methods for attempting SSS havefocused on model-based methods, taking advantage of properties of thesignals to be separated, as well as prior knowledge of the sources to beseparated.

An extensive amount of research has been published on the topic of soundsource separation, much of it focusing on techniques based onnon-negative matrix factorisation (NMF) and its variants, as well asBayesian statistical signal processing, for instance by Paris Smaragdiset al “Static and Dynamic Source Separation Using NonnegativeFactorizations: A unified view”, IEEE Signal Processing Magazine,Volume: 31, Issue: 3, May 2014, pp 66-75 and by Emmanuel Vincent et alin “From Blind to Guided Audio Source Separation: How models and sideinformation can improve the separation of sound”, IEEE Signal ProcessingMagazine, Volume: 31, Issue: 3, May 2014, pp 107-115. Such approachesperform a parts-based decomposition on the audio signal, wherein theparts typically correspond to notes or chords played by an instrument,or to drums. The parts belonging to each instrument must be groupedtogether for performing separation, and solutions have thus focused onincorporating constraints into the NMF or Bayesian statistical signalprocessing framework for enabling this grouping.

Other approaches have focused upon exploiting regularities in the soundsources to be decomposed, such as spatial position and repetition orregularity in time and/or frequency. Corresponding algorithms haveattempted to separate the sources without resorting to the use of aparts-based representation, and were very successful in specific taskssuch as drum sound separation and vocal separation. It was recentlyobserved that the bulk of these algorithms were special cases of a moregeneral framework for designing sound source separation algorithms,termed Kernel Additive Modelling (KAM) by Antoine Liutkus et al in“Kernel Additive Models for Source Separation”, IEEE Transactions onSignal Processing, Vol 62, No. 16, August 2014.

A number of data processing applications are accordingly known, whichembody some of the above-described techniques and allow their user toperform spectral editing. These applications typically requirenon-trivial data processing resources, in particular high-load,high-performance processors and significant amounts of memory, due tothe volume and complexity of calculations to perform upon audio dataaccording to the above-described techniques.

These applications also typically include functionally-comparable audiosignal processing algorithms, embodied as tools such as a “magic wand”which highlights the loudest contiguous region under a mouse pointerhovering above a spectrogram rendered in a user interface; a “harmonicmagic wand” which harmonically selects related regions of the largestcontiguous region under the mouse pointer; and other tools such as“rectangular region selection”, “erasers” and more. Some of theseapplications also operate using a layers-based paradigm, where changesor selections can be removed from an original audio signal to create anew audio track, which in turn can be further edited. Of such softwareapplications, a product marketed by Audionamix® is known to offerautomated sound source separation for a vocal monophonic source and drumseparation.

An improved method of decomposing audio signals incorporating multipleaudio sources, and a system embodying this method, are desirable formitigating at least some of the above shortcomings of the known priorart.

BRIEF SUMMARY OF THE INVENTION

The present invention provides, as set out in the appended claims, adistributed method and system for decomposing audio signalsincorporating multiple audio sources into audio signals incorporatingdiscrete audio sources, wherein the decomposition is performedautomatically in respect of at least some of the audio sources.Techniques introduced in the method and system of the invention involveboth NMF and KAM signal processing frameworks adapted with constraintsand optimisations to improve separation quality.

According to an aspect of the present invention, there is thereforeprovided a distributed system for decomposing audio signals includingmixed audio sources, comprising at least one client terminal, a remotequeuing module and at least one remote audio data processing moduleconnected in a network, wherein each client terminal is programmed tostore source audio signal data, select at least one signal decompositiontype, upload source audio signal data with data representative of thedecomposition type selection to the queuing module, and downloaddecomposed audio signal data; each queuing module is programmed to queueuploaded source audio data and distribute same to one or more audio dataprocessing modules and to queue uploaded decomposed audio signal dataand distribute same to the or each client terminal; and each audio dataprocessing module is programmed to process distributed source audio datainto decomposed audio signal data according to the type selection, andupload decomposed audio signal data to the at least one remote queuingmodule.

In an embodiment of the system, the decomposition type preferablycomprises at least one selected from a vocal audio source separation anda drums audio source separation. In an alternative embodiment, a furtherpan separation may be selectable, which decomposes the source audiosignal according to the location of audio sources therein.

Upon a selection of a vocal audio source separation, each audio dataprocessing module preferably processes distributed source audio data forseparating at least the vocal audio source therefrom, with a firstsequence of algorithms implementing non-negative matrix factorisations.Each client terminal may be further programmed to constrain one or morealgorithms of the first sequence with respective variables encoded inthe data representative of the decomposition type selection.

Alternatively, upon a selection of a drums audio source separation, eachaudio data processing module preferably processes distributed sourceaudio data for separating at least the drums audio source therefrom,with a second sequence of algorithms implementing non-negative matrixfactorisations. In an alternative embodiment of this system, at leastone algorithm of the second sequence may implement a Kernel AdditiveModelling technique for processing the distributed source audio data.

In an embodiment of the system, each client terminal may be furtherprogrammed to locally process stored source audio signal data with oneor more locally-stored decomposition algorithms into edited audio signaldata. Usefully this variant allows some of the lesscomputationally-expensive algorithms to be processed locally,independently of connectivity to the queuing module and by way ofpreview for the user. A particularly useful embodiment of this variantmay implement the at least one KAM algorithm of the second sequenceassociated with drums track decomposition as a locally-storeddecomposition algorithm.

In an embodiment of the system, each client terminal may be furtherprogrammed to combine any one or more of stored source audio signaldata, downloaded decomposed audio signal data and edited audio signaldata _(−N) into a new audio signal _(N+).

According to another aspect of the present invention, there is alsoprovided a computer-implemented method for decomposing a digital audiosignal including mixed audio sources in a network, comprising the stepsof selecting a source audio signal data and a decomposition type at aclient terminal; uploading the source audio signal data and datarepresentative of the selected decomposition type to a queuing module;queuing the uploaded source audio data, and distributing same to anaudio data processing module from the queuing module; processing thedistributed source audio data into decomposed audio signal data at theaudio data processing modules with a sequence of algorithms implementingnon-negative matrix factorisations, wherein the sequence is determinedby the type selection data; uploading the decomposed audio signal datato the queuing module; and queuing the uploaded decomposed data anddistributing same to the client terminal from the queuing module.

In an embodiment of the method, the step of selecting a decompositiontype preferably comprises selecting at least one selected from a vocalaudio source separation and a drums audio source separation.

Upon a selection of a vocal audio source separation, the step ofprocessing the distributed source audio data may comprise separating atleast a vocal audio source therefrom, with a first sequence ofalgorithms implementing non-negative matrix factorisations.Alternatively, upon a selection of a drums audio source separation, thestep of processing the distributed source audio data may compriseseparating at least a drums audio source therefrom, with a secondsequence—of algorithms implementing non-negative matrix factorisations.

According to a further aspect of the present invention, there is alsoprovided a set of instructions recorded on a data carrying medium orstored at a network storage medium which, when read and processed by adata processing terminal connected to a network, configures the terminalto perform the steps of embodiments of the method as described herein.

Other aspects are as set out in the claims herein.

BRIEF DESCRIPTION OF THE DRAWINGS

For a better understanding of the invention and to show how the same maybe carried into effect, there will now be described by way of exampleonly, specific embodiments, methods and processes according to thepresent invention with reference to the accompanying drawings in which:

FIG. 1 is a hardware diagram of an embodiment of a system according tothe invention, within a network environment comprising a plurality ofdata processing terminals remote from each other and in datacommunication with each other, wherein the system comprises a clientterminal and instances of a server terminal.

FIG. 2 is a simplified diagram of a typical hardware architecture of aclient terminal shown in FIG. 1, including a processor and memory meansstoring a client set of data processing instructions and a sourcedigital audio signal.

FIG. 3 is a simplified diagram of a typical hardware architecture of aclient and/or server terminal shown in FIG. 1, including a processor andmemory means storing a server set of data processing instructions.

FIG. 4 is a functional diagram of the system shown in FIG. 1 showing theclient set of instructions of FIG. 2 in data communication with a remotequeuing module and remote instances of a digital audio data processingmodule embodied by the server set of data processing instructions ofFIG. 3.

FIG. 5 is a logical diagram of the contents of the memory means of theclient terminal shown in FIGS. 1, 2 and 4, including the client set ofinstructions, a user interface, the source digital audio signal and anoutput digital audio signal decomposed according to the invention.

FIG. 6A illustrates steps of a method embodied by the client set ofinstructions of FIGS. 4 and 5, including steps of uploading the sourcedigital audio signal data to the remote queuing module, selecting adecomposition type and selecting decomposition constraints.

FIG. 6B illustrates sub-steps of the method of FIG. 6B, associated withlocal spectral editing of a source and/or decomposed digital audiosignal.

FIG. 7 is a logical diagram of the contents of the memory means of theserver terminal shown in FIGS. 1, 3 and 4, including the queuing moduleand instances of the audio data processing module.

FIG. 8 illustrates steps of a method embodied by the server set ofinstructions at the queuing module of FIGS. 4 and 7.

FIG. 9 illustrates steps of a method embodied by the server set ofinstructions at each instance of the audio data processing module ofFIG. 7, including steps of decomposing according to a selecteddecomposition type and type-dependent constraints.

FIG. 10 illustrates sub-steps of a first embodiment of the decompositionstep of FIG. 9, for extracting vocal data from the source audio data.

FIG. 11 illustrates sub-steps of a second embodiment of thedecomposition step of FIG. 9, for extracting drums data from the sourceaudio data.

DETAILED DESCRIPTION OF THE DRAWINGS

There will now be described by way of example a specific modecontemplated by the inventors. In the following description numerousspecific details are set forth in order to provide a thoroughunderstanding. It will be apparent however, to one skilled in the art,that the present invention may be practiced without limitation to thesespecific details. In other instances, well known methods and structureshave not been described in detail so as not to unnecessarily obscure thedescription. As used herein, the expressions ‘audio signal’ and ‘track’should be understood by the skilled reader as indicating a stereo signalhaving two channels.

With reference to FIGS. 1 to 4, an example embodiment of a system 100according to the invention is shown within a networked environment, inwhich several data processing terminals 101, 102, 103 are connected to aWide Area Network (WAN) 104, in the example the Internet, through avariety of networking interfaces, wherein network connectivity andinteroperable networking protocols of each terminal allow the terminalsto connect to one another and to communicate data to, and receive datafrom, one another according to the methodology described herein.

The system comprises at least one client terminal 101 operated by auser, a first example of which may be a personal computer terminal 101,which is configured to emit and receive data, including audio and/oralphanumerical data, encoded as a digital signal over a wired datatransmission 105, wherein said signal is relayed respectively to or fromthe client terminal 101 by a local router device 106 implementing awired and/or wireless local network operating according to the IEEE802.3-2008 Gigabit Ethernet transmission protocol. The router 106 isitself connected to the WAN 104 via a conventional optical fiberconnection over a wired telecommunication network 107.

An alternative or additional client terminal is shown at 102 which, inthe example, is a portable communication device operated by the same oranother user, e.g. a smartphone. The client terminal 102 emits andreceives data, including audio and/or alphanumerical data, encoded as adigital signal over a GPRS, 3G or 4G-compliant wireless datatransmission 108, wherein the signal is relayed respectively to or fromthe smartphone 102 by the geographically-closest communication linkrelay 109 _(N) of a plurality thereof. The plurality of communicationlink relays 109 _(N) allows digital signals to be routed betweenportable devices like the client terminal 102 and their intendedrecipient by means of a remote gateway 110. The gateway 110 is forinstance a communication network switch, which couples digital signaltraffic between wireless telecommunication networks, such as the networkwithin which wireless data transmissions 108 take place, and the WAN104. The gateway 110 further provides layer and communication protocolconversion as required.

The system also comprises at least one server terminal 103. The serverterminal 103 is configured to emit and receive data, including audioand/or alphanumerical data, encoded as a digital signal over a wireddata transmission 105, wherein said signal is relayed respectively to orfrom the client terminal 101 by a local router device 106 implementing awired local network operating according to the IEEE 802.3-2008 GigabitEthernet transmission protocol. The router 106 is itself connected tothe WAN 104 via a conventional optical fiber connection over a wiredtelecommunication network 107. In a preferred embodiment shown in FIG.1, the system comprises a plurality of servers 103 _(1-N), across whichthe data storing and processing tasks described herein with reference toa remote queuing module and a remote audio processing module are shared,and which are presented to any connected client terminal 101, 102 as aunified resource hosted in a ‘cloud’ 114 portion of the WAN 104. In theexample, a further 2 server terminals 103 _(2, 3) are shown connected tothe same local router device 106 as the first server terminal 103 forthe sake of simplicity, however it will be readily understood by theskilled readers that these 2 servers 103 _(2,3), and any further serverterminals 103 _(4-N) added to scale up the system's storage and audiodata processing capacity, may all be remote from each other anddistinctly connected to the WAN 104.

A typical hardware architecture of the smartphone client terminal 102 isshown in FIG. 2 in further detail, by way of non-limitative example. Thesmartphone 102 includes a data processing unit 201 such as ageneral-purpose multi-core microprocessor, for instance conforming tothe Snapdragon® architecture designed and marketed by Qualcomm®, actingas the main controller of the client terminal 102 and which is coupledwith memory means 202, comprising volatile random-access memory (RAM),non-volatile random-access memory (NVRAM) or a combination thereof.

The user terminal 102 further includes networking means. Communicationfunctionality in the smartphone 102 is provided by a modem 203, whichprovides the interface to external communication systems, such as theGPRS, 3G or 4G cellular telephone network 108 shown in FIG. 1,associated with or containing an analogue-to-digital converter 204,which receives an analogue waveform signal through an aerial 205 fromthe communication link relay 109 _(N) and processes same into digitaldata with the data processing unit 201 or a dedicated signal processingunit. Alternative wireless communication functionality is provided by awireless network interface card (WNIC) 206 interfacing the smartphone102 with wireless local area networks (WLAN), for instance generated bya combined wired-wireless router 106. Further alternative wirelesscommunication functionality may be provided by a High Frequency RadioFrequency Identification (RFID) networking interface implementing NearField Communication (NFC) interoperability and data communicationprotocols for facilitating wireless data communication over a shortdistance with correspondingly-equipped devices.

The CPU 201, NVRAM 202 and networking means 203 to 206 are connected bya data input/output bus 207, over which they communicate and to whichfurther components of the smartphone 102 are similarly connected inorder to provide wireless communication functionality and receive userinterrupts, inputs and configuration data. Accordingly, user input maybe received from data input interfaces, which for the smartphone 102typically comprises a limited number of keys or buttons 208 and/or acapacitive or resistive touch screen feature of the display unit 209.Further input data may be received as analogue sound wave data by amicrophone 210, digital image data by a digital camera lens 211 anddigital data via a Universal Serial Bus (USB) 212. Processed data isoutput locally as one or both of display data output to the display unit209 and audio data output to a speaker unit 213. Power is supplied tothe above components by the electrical circuit 214 of the device 102,which is fed by an internal battery module 215 interfaced with a powerconverter 216 suitable for connection to a local mains power source.

A typical hardware architecture of a desktop computer 101 and a serverterminal 103 _(1-N) is shown in FIG. 3 in further detail, by way ofnon-limitative example. Each data processing terminal 101, 103 is acomputer configured with a data processing unit 301, data outputtingmeans such as video display unit (VDU) 302, data inputting means such asHiD devices, commonly a keyboard 303 and a pointing device (mouse) 304,as well as the VDU 302 itself if it is a touch screen display, and datainputting/outputting means such as the wired network connection 105 tothe WAN 104 via the router 106, a magnetic data-carrying mediumreader/writer 306 and an optical data-carrying medium reader/writer 307.

Within data processing unit 301, a central processing unit (CPU) 308provides task co-ordination and data processing functionality. Sets ofinstructions and data for the CPU 308 are stored in random access memorymeans 309 and a hard disk storage unit 310 facilitates non-volatilestorage of the instructions and the data. A network interface card (NIC)311 provides the interface to the network connection 105. A universalserial bus (USB) input/output interface 312 facilitates connection tothe HiD keyboard and pointing devices 303, 304 besides any furtherUSB-compliant external device or component, for example a portable datastorage device (not shown).

All of the above components are connected to a data input/output bus313, to which the magnetic data-carrying medium reader/writer 306 andoptical data-carrying medium reader/writer 307 are also connected. Avideo adapter 314 receives CPU instructions over the bus 313 foroutputting processed data to VDU 302. All the components of dataprocessing unit 301 are powered by a power supply unit 315, whichreceives electrical power from a local mains power source and transformssame according to component ratings and requirements.

As skilled persons will readily understand, in functional terms therespective hardware architectures of the smartphone 102 and the personalcomputer and server terminals 101, 103 are comparable, withdifferentiation arising only in respect of the miniaturization, wirelessoperation and ergonomic handling required for the smartphone 102,relative to components designed for durability and redundancy ofoperation for the computers 101, 103.

At the client terminal 101, source audio signals encoded on an opticaldata-carrying medium 317, typically a Compact Disc®, may be read via theoptical data-carrying medium reader/writer 307. After converting thesource audio signal bitstream read from the disc 317 according to eithera lossless encoding format (e.g. Free Lossless Audio Codec, ‘FLAC’) or alossy encoding format (e.g. MPEG-1 Layer-3, ‘MP3’), source audio signaldata may be stored in the HDD 310 as a digital audio signal.Alternatively or additionally, and particularly in the case of themobile client terminal 102 without optical reading means 307, sourceaudio signals already encoded as lossless or lossy digital audio signaldata may be downloaded at the client terminals 101, 102 from a remoteresource in the WAN 104 and/or from a local storage device, either inlocal network connection with the router 106 or in local directconnection via a USB interface 312.

With reference to FIG. 4, the system 100 processes digital audio signalsthrough a distributed data processing architecture comprising a clientapplication 401 _(1-N) stored and processed at each client terminal 101,102, and server-hosted applications, including at least one queuingmodule 402 and one or more audio data processing module 403 _(1-N). Eachclient application 401 _(1-N) allows its user to load source audiosignal data 411, to play back loaded source audio signal data 411 andany edited version(s) thereof, to parameterize and requestdecompositions, i.e. separation of audio sources in a mixed audiosignal, and to manually edit loaded source audio signal data 411 and anyedited version(s) thereof using a variety of local audio processingfunctions.

The queuing module 402 is an always-on data processing resource, whichis invoked whenever a user requests a remote decomposition at arespective client terminal 101, 102. The queuing module 402 downloadsclient-respective source audio signal data 411 _(1,2) together withdecomposition type-dependent parameters from each requesting clientterminal 101, 102, and queues same for processing according to theavailability of remote audio processing module(s) 403 _(1-N), forinstance according to a first-in, first-out buffering principle,variations and optimisations of which may be readily considered basedfor instance on a source data file size and available client-servernetwork bandwidth.

Each instantiation 403 _(1-N) of the audio data processing module in thecloud 114 is likewise an always-on data processing resource, and soreceives source audio signal data 411 _(1,2) and decompositiontype-dependent parameters from the queuing module 402, then performsautomatic separations on the received audio to output decomposed audiosignal data 423 _(1,2). Each instantiation 403 _(1-N) of the audio dataprocessing module communicates progress updates 413 _(1-N) to thequeuing module 402 whilst processing the received source audio data 411;then, upon completing the tasked decomposition, the audio dataprocessing module 403 uploads the decomposed audio signal data 423_(1,2) to the queuing module 402 and becomes available for a nextdecomposition tasking.

The queuing module 402 is thus further invoked whenever a progressupdate 413 is received from an audio data processing module 403 _(1-N),which the queuing module 402 relays in real time to the correspondingclient terminal 101, 102 at which the ongoing decomposition wasrequested; and whenever a decomposed audio signal data 423 _(1,2) isreceived after its decomposition is completed, which the queuing module402 uploads to the corresponding client terminal 101, 102 at which thecompleted decomposition was requested, moreover in real time if therelevant client terminal is network-connected at the material time.

With reference to FIG. 5 now, the contents of the memory means 202, 309of a client terminal in the data processing context of FIGS. 1 and 4initially include an operating system 501 which, in the case of desktopclient terminal 101, is for instance Windows 10® distributed byMicrosoft® of Redmond, Wash. and, in the case of portable clientterminal 101, is for instance iOS® distributed by Apple® Inc. ofSunnyvale, Calif. The OS 501 includes communication subroutines 502 toconfigure the user terminal 101, 102 for bilateral data communicationwithin the networked environment of FIG. 1.

The memory means next includes a local runtime instantiation of a clientaudio application 401, interfaced with the OS 501 via one or moreApplication Programmer Interface (API′) 503, particularly apt tooperably interface the client audio application 401 with each of theclient terminal's input, display, audio and networking functionalities.Data used and processed by the client audio application 401 primarilyincludes locally-loaded source audio signal data 411 and downloadeddecomposed audio signal data 423.

The client audio application 401 itself comprises a variety of discretefunctional modules, including a variety of local audio data-processingalgorithms 510, a variety of audio spectrogram-editing tools or‘widgets’ 520, and a user interface 540 in which to both renderspectrograms 542 and read user inputs and selections representative ofaudio editing choices and task-setting, in particular variables 514 forthe automatic decomposition of audio signal data 411.

The client application 401 allows the user to load a source audio track411. Upon completing this loading, the application 401 allows a user torequest generation of a spectrogram 542 in the UI 540, which plots howthe frequency content of the loaded audio signal changes over time. Theediting of audio signals based on spectrograms is known as spectralediting, which is performed both locally through the suite of audiospectrogram-editing tools 520 associated with local audiodata-processing algorithms 510, and remotely through the queuing module402 and instantiations of server audio processing module 403 _(1-N).

In the UI 540, the user may either edit the source audio signal 411locally, using the manual editing tools 510, 520, or the user mayrequest one of three types of automatic separations processed remotelyat a server 103. The suite of local spectrogram-editing tools 520comprises known spectrogram-interacting tools, such as “magic wand,“harmonic wand selection”, “rectangle selection” and “eraser” selectionwidgets. The editing tools 520 further comprise a “transient” selectingtool, the function of which is to detect the presence of drum hits in amixed audio signal, and an “amplitude threshold” selecting tool, thefunction of which is to select elements in a region of a spectrogram,that are above a given amplitude threshold.

The local audio data-processing algorithms 510 can be applied to theseselections for modifying the audio signal in various ways. A “SpotRemoval” algorithm performs a local separation on the loaded audio trackby detecting similar regions in the signal and then removing parts ofthe signal that do not repeat between the detected regions. Depending onthe input signal 411 and the context of use, this algorithm can be veryeffective in removing noise or lead instruments from the mixed signal.Another “Drums removal B” algorithm is based on a Kernel AdditiveModelling (KAM) framework and incorporates a number of kernels forseparating drums and non-drums from the mixed signal.

The client application 401 implements a layers-based principle, whereinany edits whether to the source audio signal 411, to the decomposedaudio signal 423 or to an intermediate version of either audio signal,can be exported to a new audio track or layer, for further editing andmanipulation in the UI 540. Accordingly a plurality of spectral edits 1to N are shown at 530 by way of example, wherein the output of oneautomatic or local separation 530 _(N) can be fed into anotherseparation e.g. 530 _(N−1) or 530 _(N+1) and wherein these layers e.g.530 _(N), 530 _(N+1) can also be merged together to create a newcomposite layer 530 _(N+2), which combines user-selected elements andedits of the original track 411. Finally, through the API 503 createdlayers 530 _(1-N) can be exported as digital audio files encoded in alossless or lossy format for use in other audio applications.

FIGS. 6A and 6B illustrate steps of the main functionality provided bythe client set of instructions 401 as described with reference to FIG.5, at each user terminal 101, 102. When a user switches their terminalon, the OS 501 is initially loaded with its suite of networkingprotocols 502 at step 601. The user then loads a runtime instantiationof the client application 401 into the memory 309 at step 602, eitherfrom local storage 202, 310, or from a remote resource such as a serverterminal 103 across the WAN 104, 114. The client application 401instantiates its graphical user interface 540 onto the terminal displaymeans at step 603, wherefrom the user may select and locally load asource audio signal 411 at step 604. The client application 401generates a spectrogram 542 of the loaded source audio signal andoutputs same to the GUI 540 at step 605.

A question is then asked at step 606, about whether the user wishes totask a remote server application 403 with automatically decomposing theloaded source audio signal 411. When the question of step 606 isanswered positively, the user then selects a decomposition type at step607, from “vocals”, “drums” or “pan”: the respective implementations of“vocals” and “drums” decompositions at the server audio processingmodule 403 attempt to separate these sources from the source audiosignal, whilst the implementation of “pan” separates sources based ontheir position in the stereo field encoded within the source audiosignal.

When the user selects a “vocals” decomposition type, the clientapplication 401 requires the user to input decomposition variables 514,namely: a selection of a threshold frequency for the source audio signalas a first decomposition variable at step 608; the identification of alocation for the vocal source within the stereo field of the sourceaudio signal as a second decomposition variable at step 609; a selectionof a note activation range, for instance of ±1.5 semitones in the sourceaudio signal as a third decomposition variable at step 610; and aselection of a time shift value as a fourth decomposition variable atstep 611. When the user selects a “pan” decomposition type, the clientapplication 401 only requires the second variable to be input at step609. The decomposition variables are then used as constraints on thedecomposition algorithms to reduce and refine the possible solutions tothe decomposition algorithms. The identification of the vocal positionbeing used to give extra emphasis to notes occurring in this region ofthe stereo field. The note activation range is used to control how widea region in frequency is passed surrounding the predominant melodyidentified in subsequent steps, while the time-shift controls how muchreverb is captured by the algorithm.

Upon completing the input of decomposition variables 514, oralternatively when the user selects the third “drums” decompositiontype, the client application 401 uploads the source audio data 411,together with any decomposition variables data 514 as input, to thequeuing module 402 at step 612. The client application 401 theneventually receives periodical updates at step 613, about the progressof the decomposition processed automatically at a remote server audioapplication 103 _(1-N). The client application 401 then eventuallydownloads the decomposed audio signal data 423 from the queuing module402 at step 614, upon completion of the separating task at the remoteserver audio application 103 _(1-N).

Alternatively, with reference to FIG. 6B now, when the question of step606 is answered negatively, the logic proceeds to a further question atstep 621, about whether the user wishes to perform a local spectral editon the spectrogram 542 in the UI 540. When the question of step 621 isanswered positively, the client application 401 reads user selection(s)of, and interaction(s) with, a widget 520 in the UI 540 at step 622,then processes the audio data track corresponding to the spectrogram 542interacted with at step 623.

As the logic described with reference to FIGS. 6A and 6B defines aniterative loop, then the spectrogram 542 interacted with at a firstinstantiation of step 622 and the corresponding audio data trackprocessed at a first instantiation of step 623 relate to the sourceaudio track 411 loaded at a first instantiation of step 604. Subsequentiterations of steps 622 and 623 may relate to any of the loaded sourceaudio track 411, a downloaded decomposed audio track 423 or anintermediate edited track 530. With reference to FIG. 5, the interactionand processing of steps 622, 623, may in particular result in processingthe audio track locally with either of the “spot removal” and/or the“drums removal B” algorithm 510. In any and all cases, at a next step624 the spectrogram 542 in the GUI 540 is updated according to theoutput of the processing in the immediately-preceding instantiation ofstep 623.

Adverting to the description of layers 530 herein, when the question ofstep 621 is answered negatively however, then a further question is thenasked at step 625, about whether the user wishes to associate one ormore spectrogram(s) 542, corresponding to respective runtime layer(s)530 _(1(−N)), with a new layer 530 _(N+1). When the question of step 625is answered positively, at step 626 the client application 401 reads auser selection of the one or more spectrogram 542 _(1(−N)) in the GUI540 corresponding to the layer(s) 530 _(1(−N)) of interest, and whichcan correspond to any one or more of the loaded source audio track 411,a downloaded decomposed audio track 423 and an audio track previouslyedited according to steps 622 to 624. At a next step 627, the clientapplication 401 associates the selected layer(s) 530 _(1(−N)) with a newlayer 530 _(N+1). At step 628, the client application 401 then declaresthe new layer 530 to be the current runtime layer, upon which furtheredits, local or remote, shall be performed whence control immediatelyreturns to step 622.

Alternatively, when the question of step 625 is answered negatively,then, returning to FIG. 6A now, a further question is then asked at step615, about whether the user wishes to perform a further edit on any ofthe loaded source audio track 411, a downloaded decomposed audio track423 or an intermediate edited track 530. When the question of step 615is answered positively, control returns to the first question of step606, wherein the user may either locally edit any of the loaded sourceaudio track 411, a downloaded decomposed audio track 423 or anintermediate edited track 530 according to steps 621 to 628, or the usermay instead task a remote server application 403 with automaticallydecomposing any of these track types 411, 423, 530.

Alternatively, the question of step 615 is answered negatively and aquestion is last asked at step 616, about whether the user wishes toselect a new source audio signal 411 for editing purposes. When thequestion of step 616 is answered positively, control returns to step 604for a suitable selection and loading. Alternatively, the user may wishto interrupt use of the client application 401 and the question isanswered negatively, whence its runtime instantiation may eventually beunloaded from memory 202, 309 and the terminal eventually switched off.

Continuing with the description of the system 100, with reference toFIG. 7 now, the contents of the memory means 309 of a server terminal103 in the data processing context of FIGS. 1 and 4 initially include anoperating system 501, which is for instance Windows Server® distributedby Microsoft® of Redmond, Wash. The OS 501 again includes communicationsubroutines 502 to configure the server terminal 130 for bilateral datacommunication within the networked environment of FIG. 1.

The memory 309 means next includes the queuing module 402 and, in thisembodiment, at least a first instantiation 403 ₁ of the server audioprocessing module 403, both of which modules are interfaced with the OS501 via one or more Application Programmer Interface (API′) 503,particularly apt to operably interface each module 402, 403 with eachother and with the server terminal's input and networkingfunctionalities. It will be readily understood by the skilled personthat, subject to scaling of the system 100, other embodiments may havethe queuing module 402 distinctly processed at one or more serverterminal(s) 103 and instantiations of the server audio processing module403 at still other server terminal(s) 103, all composing the cloudportion 114 in the WAN 104 that is associated with the system 100.

Data used and processed by the queuing module 402 at runtime includessource audio signal data 411 _(1-N) being downloaded from respectiveremote client applications 401 _(1-N) pursuant to client step 612, anddecomposed audio signal data 423 _(1-N) being downloaded from the oreach instantiation 403 _(1-N) of the server audio processing module 403,whether local as illustrated in FIG. 7 and/or remote as illustrated inFIGS. 1 and 4, prior to client step 614.

The queuing module 402 further hosts and processes a data structure 714,such as a database, which references all runtime instantiations 403_(1-N) of the server audio processing module 403 composing the cloudportion 114 of the WAN 114, in instantiation-respective records 724_(1-N) that store data representative of an instantiation's networkaddressing particulars, of source audio data uploaded thereto fordecomposition, of decomposed audio data downloaded therefrom, and ofdecomposition tasking status. The database further references all clientapplications 401 _(1-N) connected to the system 100 from respectiveclient terminals 101, 102, in client terminal-respective records 734_(1-N) that store data representative of a client terminal's networkaddressing particulars, of source audio data downloaded therefrom, andof decomposed audio data uploaded therefrom. The database furtherreferences progress updates 413 _(1-N) received from server audioprocessing module instantiations 403 _(1-N) and forwarded to respectiveclient applications 401 _(1-N) with networking addressing reconciliatedfrom instantiation records 724 _(1-N) and client terminal records 734_(1-N).

Data used and processed by the or each instantiation 403 _(1-N) of theserver audio processing module 403 at runtime includes datarepresentative of a source audio signal 411 and decomposition type data514 including optionally, in case the tasking is a “vocals” type ofdecomposition, decomposition variables data, all downloaded from thequeuing module 402 pursuant to a tasking. Data output by the or eachinstantiation 403 _(1-N) to the memory 309 includes decomposed audiosignal data 423 ready for uploading to the queuing module 402. The oreach instantiation 403 _(1-N) further comprises decomposing logic 760,comprising a plurality of audio data processing algorithms variouslyused according to the type of decomposition selected, and furtherdescribed with reference to FIGS. 10 and 11 herein.

FIG. 8 illustrate steps of the main functionality provided by thequeuing module 402 as described with reference to FIGS. 1 to 7, at aserver terminal 103. Initially a question is asked at step 801, aboutwhether a decomposition request has been received from a remote clientapplication 401 _(1-N) pursuant to step 612. When the question of step801 is answered positively, the queuing module 402 first validates therequest by identifying the source audio data type, the decompositiontype requested and the presence of decomposition data associated with a“vocal” decomposition at step 802. At a next step 803, the queuingmodule generates a client record 734 for the received request in thedatabase 714 to reference the source audio signal 411 and the requestingclient application 401 therein. At a next step 804 the queuing module402 begins to download the source audio data 411 from the requestingclient terminal 101, 102.

In parallel with initiating the download step 804, the queuing module402 checks the instantiations records 724 _(1-N) in the database 714 atstep 805, for the first instantiation 403 having a decomposition taskingstatus with a nil value, representative of a server audio dataprocessing module 403 awaiting a next source audio data 411 todecompose. Upon identifying a waiting instantiation, the queuing module402 then generates a new instantiation record 724 for the receivedclient request in the database 714 at step 806, to reference thedownloading source audio signal 411 and the tasked server application403 therein. At a next step 807 the queuing module 402 begins to uploadthe source audio data 411 to the tasked server terminal 103.

When completing the uploading or transferring of the source audio data411 or when the question of step 801 is answered negatively, controlproceeds to a second question at step 808, about whether the queuingmodule 402 has received an indication that a decomposition has beencompleted at a tasked server audio data processing module 403 _(1-N).

When the question of step 808 is answered positively, the queuing module402 first matches the received indication with the corresponding serverrecord 724 in the database 714 at step 809, then flushes the downloadedsource audio data 411, along with its associated decomposition type andparameters data 514, corresponding to the source audio data 411referenced in the matched server record 724 from local or remote queuingmodule storage at step 810, before beginning to download the decomposedaudio data 423 from the tasked server terminal 103 at step 811.

In parallel with initiating the download step 811, the queuing module402 matches the flushed downloaded source audio data 411 with thecorresponding client record 734 in the database 714 at step 812. Uponestablishing a network connection to the matched client terminal 101,102, at a next step 813 the queuing module 402 begins to upload thedecomposed audio data 423 thereto.

When completing the uploading of the decomposed audio data 423 or whenthe question of step 808 is answered negatively, control proceeds to athird question at step 814, about whether the queuing module 402 hasreceived a progress update 413 from a tasked server audio dataprocessing module 403 _(1-N). When the question of step 814 is answeredpositively, the queuing module 402 matches the tasked server record 724associated with the received update 412 with the corresponding taskingclient record 734 in the database 714 at step 815, then forwards theprogress update 413 to the matched client application 401 at step 816.Control then returns to the original question of step 801, likewise whenthe question of step 814 is answered negatively.

FIG. 9 illustrate steps of the main functionality provided by eachserver audio data processing module 403 _(1-N) as described withreference to FIGS. 1 to 8, at a server terminal 103. At a first step901, the module 403 receives source audio signal data 411 anddecomposition parameterizing data 514 comprising at least datarepresentative of the tasked decomposition type, i.e. “vocal”, “drums”or “pan” and, if the tasked decomposition type is “vocal”, algorithmconstraints data originally input by the user at steps 608 to 611 of theclient application 401. A question is next asked at step 902, aboutwhether the tasked decomposition type is indeed “vocal” or not. When thequestion of step 902 is answered positively, the module 403 constrainseach of the parameterisable algorithms 760 in a first sequence thereof,that are involved by the vocal separation data processing and describedin further details with reference to FIG. 10, with the received userinput 514 at step 903. The module 403 next processes the mixed-sourcesource audio data 411 with the first sequence of algorithms at step 904to filter the vocal source, and outputs at least the extracted vocalaudio track as a first decomposed audio data 423 at step 905. In analternative embodiment, also shown in FIG. 9, the processing of step 905further outputs an instrumental audio track omitting the extracted vocalaudio track as a second decomposed audio data 423 at step 906.

Alternatively, when the question of step 902 is answered negatively, themodule 403 next processes the mixed-source source audio data 411 with asecond sequence of algorithms, that are involved by the percussionsseparation data processing and described in further details withreference to FIG. 11, at step 914. The second sequence of algorithms isdesigned to filter percussions source(s) from the mixed-sources audiosignal data 411, thus the module 403 outputs at least the extracteddrums audio track as a first decomposed audio data 423 at step 915. Inan alternative embodiment, also shown in FIG. 9, the processing of step914 further outputs a pitched sources audio track omitting the extracteddrums audio track as a second decomposed audio data 423 at step 916.

In another embodiment of the server audio processing module 403accommodating a “pan” decomposition type (not shown), an intermediaryquestion is asked before step 914, as to whether the taskeddecomposition type is “drums” or not and which, when answeredpositively, proceeds to step 914 but, when answered negatively, causesthe module 403 to next process the mixed-source source audio data 411with a single pan-based audio separation algorithm, described in furtherdetails with reference to FIG. 10, and to eventually output twodecomposed audio signals 423, one for the strongest audio source in thestereo field and the other for the original mixed-source audio signalomitting the extracted strongest audio source. The pan-based algorithmrecovers more than 2 sources, the number can be user-defined but notless than 3, which makes it different from the other separationalgorithms. It can be configured with two modes of operation, the firstis using a pre-defined set of equally-spaced directions, where thealgorithm decomposes the input mixture signal into a set of sourcesemanating from the pre-defined directions. The second mode of operationis where the user choses the number of directions, and the algorithmlearns the directions associated with the user-chosen number of sources.The second mode is considerably more computationally intensive than thefirst mode, which can operate in less than real-time.

Further to outputting any of the decomposed audio signal data 423 at anyof steps 905, 906, 915, 916 and per the alternative embodiment describedimmediately above, control invariably proceeds to step 920, at which themodule 403 sends the indication of completion, then uploads thedecomposed audio signal 423 to the queuing module 402. At step 921, themodule 403 flushes the source audio data 411 and decomposition data 514downloaded at the previous instantiation of step 901 from its cache, andsends a status message to the queuing module 402 for updating itsdecomposition tasking status 734, whence it may then be tasked again bythe queuing module 402 in due course. The module 403 thus enters awaiting state at a next step 922, for awaiting a next source audio datasignal.

The step 904 of processing the source digital audio signal to extractand output a decomposed vocal track is now described in further detailwith reference to in FIG. 10. The main logic of the vocal separationprocess is based on a non-negative matrix factorisation framework,wherein estimates are generated of the pitches and other events, as wellas the time-varying timbres of these events, in the source signal. Thepredominant or vocal melody in the source audio signal 411 is determinedaccording to these estimates, and used as the basis to generate filtersthat are then applied to recover the vocal track from the source audiosignal 411. How the predominant melody/vocal melody is determined fromthe estimates of pitches and other events is detailed below. Otherevents in this case includes the occurrence of non-pitched sounds in theaudio mixture, potentially including drums and percussion as well asnon-pitched vocal sounds such as consonants, plosives and fricatives.Timbre—the character or quality of a musical sound or voice as distinctfrom its pitch and intensity. In this case, it can be taken to mean thetime-varying frequency content of an event. Once the predominant melodyhas been estimated, it can then be used to estimate a spectrogram of thevocal melody. This estimate is then used to create a filter, such as aWiener filter or other suitable type of filter which is then applied tothe original complex audio spectrogram before inversion to the timedomain audio signal.

A first ‘bass/kick reduction’ algorithm is applied at step 1001. Theinput source audio signal 411 is initially filtered to remove everythingabove the frequency defined by the user with the remote clientapplication 401 at step 608. This frequency is set as high as possiblewhile still ensuring that no significant vocal energy can be heard inthe filtered signal. A spectrogram of the remaining signal is thenanalysed using non-negative matrix factorisation, using a fixed numberof basis functions to learn a spectral dictionary and thetime-activations associated with the spectral dictionary. A spectrogramof the full input signal is then obtained, and a constrainednon-negative matrix factorisation is then performed on that secondspectrogram, recovering a second spectral dictionary and theirassociated time activations. A subset of the recovered dictionary isforced to have the same time activations as those learned from theprevious step. This allows recovery and removal through filtering, ofhigher-frequency energy which is primarily associated with events thathave their main energy below the user-defined frequency from theprevious step. The vocal signal is thus easier to identify and recoverin the remaining signal, by generating two distinct signals whichrespectively contain drums and non-drum elements.

A second ‘pan-based separation’ algorithm is next applied at step 1002.This algorithm is based on the assumption that the lead vocal sourceusually has its energy coming from a specific position in the stereofield. The algorithm incorporates a spatial model into a non-negativematrix factorisation framework, which allows separation of sources instereo signals based on their direction in the stereo field, wherein theposition of the lead vocal source in the stereo field is identified bythe user with the remote client application 401 at step 609. The sourceaudio signal data 411 is thus processed to filter out energy, which isnot coming from the spatial region associated by the user with the vocalsource.

Having removed low-frequency energy and energy coming from directionsnot associated with the vocal source from the source audio signal data411 at steps 1001 and 1002 respectively, a third ‘melody estimation’algorithm is next applied at step 1003. This algorithm initiallygenerates a variable Q spectrogram of the source audio signal data 411,which is then factorised using a shift-invariant non-negative matrixfactorisation, wherein pitched notes are constrained to have harmonicpatterns. These patterns are created through time-varying weightedcombinations of harmonic templates, the weights of which are learnedduring the factorisation process. These weighted harmonic templates areconvolved with note activations, which are also learned during thefactorisation process, to estimate a model of the variable Qspectrogram. Non-harmonic information is modelled using standardnon-negative matrix factorisation. Once a sufficient number ofiterations has been performed, the note activations are analysed usingan algorithm similar to the Viterbi algorithm, to determine thepredominant pitch or melody in the signal, which typically will be thelead vocal source or another solo instrument in the mixed audio signal.The note activation functions can be randomly initialised and then theseestimates are updated in an iterative manner using a suitable costfunction such as the generalised Kullback-Liebler divergence as ameasure of fit between the original spectrogram and that estimated bythe decomposition process. Once the iterative algorithm has converged, avariant on the Viterbi algorithm can be used to identify the predominantmelody from the note activation functions. This path is determined fromthe amplitudes of the note activation functions, in conjunction withconstraints on the likelihood of large jumps in the pitch of thepredominant melody, as well as a constraint which encourages thetemporal continuity of the predominant melody.

Once the predominant melody has been determined at step 1003, a fourth‘vocal separation’ algorithm is next applied at step 1004. Substantiallythe same factorisation process as used for the melody estimation of step1003 is repeated, however wherein all note activations of thepredominant melody, that are outside the range (of typically plus orminus 1.5 semitones) defined by the user with the remote clientapplication 401 at step 610, at a given time frame are set to zero. Astandard non-negative matrix factorisation is used to model allnon-melody notes and events, which results in two estimatedspectrograms, one for vocals audio data and one for backing track audiodata. These estimated spectrograms are then used to filter the variableQ spectrogram before inversion of the separated vocal and backing tracksignals to the time domain.

The resulting vocal and backing track signals recovered after step 1004typically have artefacts and noise present. Some of these artefacts areas a result of inconsistencies in how the vocal signal and/or thebacking track has been modelled in the individual channels of thesignal. In order to at least reduce, and optimally remove, theseartefacts, a fifth ‘spatial modelling’ algorithm is next applied at step1005. In this algorithm, the vocals and backing track signals separatedat step 1004 are transformed to the time-frequency domain using ashort-time Fourier transform (STFT), and spatial modelling is performedto identify a coherent stereo model for the lead vocal and a distinctcoherent stereo model for the backing track. This is done by taking theSTFT of each channel in the vocal signal and projecting them againsteach other in a number of different directions, so that phasecancellation occurs.

The resulting projections are aggregated into a tensor, and the tensoris factorised to yield a single spectrogram which best fits the vocalsignal, and a spatial direction activation vector, which describes inwhich direction the estimated spectrogram is coming from in the spatial,stereo field. The backing track then undergoes the same processingmethod, whereby coherent spatial models obtained are then used to filterthe source audio data signal 411, leading to recovered vocal and backingtracks 905, 906 that contain less artefacts than was obtained at the endof stage 1004. The tensor can be factorised using a non-negative matrixfactorisation approach. The tensor is of size F×T×P where F is thenumber of frequency bins, T is the number of time frames, and P is thenumber of projections. This is flattened to a matrix of size (F×T)×P.This matrix is then factorized as X=AS, where A is a vector of size(F×T) containing a flattened spectrogram for the source, and S is aspatial activation vector of size 1×P. Both A and S are randomlyinitialised and iteratively updated using a suitable cost function suchas the generalised Kullback-Leibler divergence in the manner of standardnon-negative matrix factorisation algorithms. Once the factorisation iscompleted A is then reshaped to a single spectrogram of size F×T. Theestimated single source spectrogram and associated spatial activationvector can then be used to construct a new estimate of the sourcetensor. This process is performed for the vocal tensor and the backingtrack tensor. These new estimates are then used to create a suitablefilter such as a Wiener filter which is applied to the original audiomixture.

Steps 1003, 1004 and 1005 assume that the vocal energy primarily comesfrom a single direction in the stereo field. While this is usually thecase, artificial reverberation is typically added during the mixingprocess to add a sense of space to the recordings. It is quite commonthat this reverb will come from a different direction to that of theoriginal vocal, and so is not captured by the algorithms describedabove. To compensate for this, a sixth ‘reverb modelling’ algorithm isapplied at step 1006.

In this algorithm the spectrogram of the vocal signal iscross-correlated with a spectrogram of the backing track signal. Thiscross-correlation is performed up to a time shift value defined by theuser with the remote client application 401 at step 611, and thecorrelation coefficients obtained are used to identify the strength ofthe vocal reverb remaining in the backing track. Shifted versions of thevocal spectrogram, scaled in accordance with the correlationcoefficients, are then added to the original vocal spectrogram to createan improved vocal spectrogram. This improved vocal spectrogram and thebacking track spectrogram are then used to filter the source audiosignal data 411, yielding an improved vocal signal containing the vocalreverb. The incorporation of the vocal reverb also has the effect ofmasking many of the remaining artefacts in the separated vocal signal,resulting in a decomposed vocal audio data 423 with high audio quality.The separately-decomposed backing track audio data also has considerablyless lead vocal source data remaining therein.

In an alternative embodiment of the system 100, the client application401 may provide its user with the ability to select the exporting of thevocal reverb audio data determined at step 1006, separately from thedecomposed vocal audio data 423 and from the decomposed backing trackaudio data, wherein this selection is input as further parameterisationdecomposition variable 514.

The step 914 of processing the source digital audio signal to extractand output a decomposed percussions or ‘drums’ track is now described infurther detail with reference to in FIG. 11. This logic is again basedaround a non-negative matrix factorisation framework, and incorporates anumber of constraints in time and frequency on the factorisation, inorder to generate two distinct signals which respectively contain drumand non-drum audio sources.

The ‘drum’ decomposition logic initially relies upon three distinct‘drums’ separation algorithms, the respective outputs of which are thenaggregated in a specific manner. Each ‘drums separation’ algorithmgenerates a spectrogram of the source audio signal data 411 via arespective Short-Time Fourier Transform, and generates estimatedspectrograms of the separated drums audio sources, i.e. percussioninstruments, and pitched audio sources, i.e. musical instruments.

Accordingly, the source audio signal 411 downloaded from the queuingmodule 402 is input in parallel to the first ‘drums separation A’algorithm at step 1101, to the second ‘drums separation B’ algorithm atstep 1102 and to the third ‘drums separation C’ algorithm at step 1103.

The first ‘drums separation A’ algorithm of step 1101 is a non-negativematrix factorisation-based algorithm, with additional constraints on thespectral dictionaries to be learned and the time-activations to belearned. For the drum basis functions, the constraints force thealgorithm to learn smooth spectral dictionary elements, andtransient-like time-activations, to reflect the fact that percussionsare broadband noise, whereas their occurrences are transient in nature.Conversely, for pitched instruments in the source audio data 411, thespectral dictionary is forced to be spiked or sparse in nature, whilstthe time-activations are constrained to be smooth, reflecting the factthat most notes by pitched instrument are sustained. Whereasfactorisation is usually performed either using a linear spectrogram ora log-frequency spectrogram, in this algorithm the mapping from thelinear domain to the log-frequency domain is incorporated into thealgorithm, so that reconstruction accuracy of the factorisation ismeasured in the linear domain, whereas the constraints aresimultaneously enforced in the log-frequency domain. This techniqueprovides improved results as regards removing vocal source interferencefrom the drum audio signal. In this case, dictionary spectral smoothnessis imposed by adding a continuity constraint to ensure that thedifference between two adjacent frequency bins is not too large, whilestill trying to ensure a good fit between the actual source spectrogramand the estimated spectrogram. Transient-like time activations arelearned by imposing a constraint that the difference between successivetime activations is as large as possible while still trying to ensure agood fit between the actual source spectrogram and the estimatedspectrogram.

The second ‘drums separation B’ algorithm of step 1102, which is alsoincorporated in the client application 401 as a local audio processingtool 510, is based on a Kernel Additive Modelling framework and is afast drum separation algorithm implementing the principle that drums canbe regarded as vertical ridges in spectrograms, whereas pitchedinstruments can be regarded as horizontal ridges. This is achieved bychoosing a suitable kernel for the kernel additive modelling framework.For example a kernel of Fk×1 where Fk is the number of frequency bins inthe kernel (Fk=17 for example) and 1 is the number of time frames,encourages structures which have continuity in frequency, or in otherwords, a vertical ridge in the spectrogram. Similarly a kernel of size1×Tb where 1 is the number of frequency bins in the kernel and Tb is thenumber of time frames in the kernel (for example Tb=17) encouragesstructures which have continuity in frequency, or in other words ahorizontal ridge in the spectrogram.

The third ‘drums separation C’ algorithm of step 1103 enforcessmoothness of the drum basis functions, by restricting them to becomposed of sums of Hann windows of 1 octave width in frequency, andthen performing non-negative matrix factorisation using this dictionary.

The drum spectrograms respectively output by each of the A, B and Cdrums algorithms are then combined at step 1104, by taking theelementwise median of the spectrograms. The process is repeated for thepitched instruments signal. The resulting drum and pitched spectrogramsare then further improved by a further median filtering stage, whereinthe output of the first ‘drums separation A’ algorithm of step 1001 arereplaced by the results from the first median filtering stage, andmedian filtering is again performed across all three A, B and C drumsalgorithms.

The resulting drums signal and pitched instrument signal recovered afterstep 1005 typically still have artefacts and noise present. Some ofthese artefacts are as a result of inconsistencies in how the drumssignal and/or the pitched instrument track has been modelled in theindividual channels of the signal. In order to reduce and remove theseartefacts, the decomposed signals are transformed to the time-frequencydomain using a short-time Fourier transform (STFT), then processed withthe same ‘spatial modelling’ algorithm of ‘vocal’ separation step 1005to identify a coherent stereo model for the drums signal and a separatecoherent stereo model for the pitched instruments signal. This approachresults in a decomposed drums audio data 423 with high audio quality.The separately-decomposed backing track audio data also has considerablyless percussions source data remaining therein. The present inventionthus provides a distributed audio editing system 100 apt toautomatically decompose audio mixtures 411 into audio signals withdiscrete audio source(s) 423 at comparatively little data storage andprocessing expense for a client terminal 101, 102. The system of theinvention is relevant to a wide variety of audio data processingcontexts involving variously the music industry, post-production in thebroadcast and film industries, as well as others such as audioforensics. The system of the invention may find ready applications suchas the mixing and remixing of archival material, music repurposing,content generation such as instrumental backing tracks, upmixing ofstereo tracks to surround sound formats, de-noising and repair of flawedrecordings, the elimination unwanted sounds in recordings, the removalof spill or bleed from adjacent instruments which have been recorded ina live setting, besides generally allowing increased creativity inmusical composition and sound design.

In the specification the terms “comprise, comprises, comprised andcomprising” or any variation thereof and the terms include, includes,included and including” or any variation thereof are considered to betotally interchangeable and they should all be afforded the widestpossible interpretation and vice versa. The invention is not limited tothe embodiments hereinbefore described but may be varied in bothconstruction and detail.

What is claimed is:
 1. A distributed system for decomposing audiosignals including mixed audio sources, comprising at least one clientterminal, a remote queuing module and at least one remote audio dataprocessing module connected in a network, wherein each client terminalis programmed to store source audio signal data, select at least onesignal decomposition type, upload source audio signal data with datarepresentative of the decomposition type selection to the queuingmodule, and download decomposed audio signal data; each queuing moduleis programmed to queue uploaded source audio data and distribute same toone or more audio data processing modules and queue uploaded decomposedaudio signal data and distribute same to the or each client terminal;and each audio data processing module is programmed to processdistributed source audio data into decomposed audio signal dataaccording to the type selection, and upload decomposed audio signal datato the at least one remote queuing module.
 2. The distributed system ofclaim 1, wherein the decomposition type comprises at least one selectedfrom a vocal audio source separation and a drums audio sourceseparation.
 3. The distributed system of claim 1, wherein each audiodata processing module processes distributed source audio data forseparating at least the vocal audio source therefrom, with a firstsequence of algorithms implementing non-negative matrix factorisations.4. The distributed system of claim 3, wherein each client terminal isfurther programmed to constrain one or more algorithms of the firstsequence with respective variables encoded in the data representative ofthe decomposition type selection.
 5. The distributed system of claim 1,wherein the decomposition type further comprises a separation of anaudio source location within the source audio signal.
 6. The distributedsystem of claim 1, wherein each audio data processing module processesdistributed source audio data for separating at least the drums audiosource therefrom, with a second sequence of algorithms implementingnon-negative matrix factorisations.
 7. The distributed system of claim1, wherein at least one algorithm of the second sequence implements aKernel Additive Modelling technique for processing the distributedsource audio data.
 8. The distributed system of claim 1, wherein eachclient terminal is further programmed to locally process stored sourceaudio signal data with one or more locally-stored decompositionalgorithms into edited audio signal data.
 9. The distributed system ofclaim 8, wherein the at least one KAM algorithm of the second sequenceis a locally-stored decomposition algorithm.
 10. The distributed systemof claim 1, wherein each client terminal is further programmed tocombine any one or more of stored source audio signal data, downloadeddecomposed audio signal data and edited audio signal data into a newaudio signal.
 11. A computer-implemented method for decomposing adigital audio signal including mixed audio sources in a network,comprising the steps of: selecting a source audio signal data and adecomposition type at a client terminal; uploading the source audiosignal data and data representative of the selected decomposition typeto a queuing module; queuing the uploaded source audio data anddistributing same to an audio data processing module from the queuingmodule; processing the distributed source audio data into decomposedaudio signal data at the audio data processing modules with a sequenceof algorithms implementing non-negative matrix factorisations, whereinthe sequence is determined by the type selection data, uploading thedecomposed audio signal data to the queuing module; and queuing theuploaded decomposed data and distributing same to the client terminalfrom the queuing module.
 12. The computer-implemented method of claim11, wherein the step of selecting a decomposition type comprisesselecting at least one selected from a vocal audio source separation anda drums audio source separation.
 13. The computer-implemented method ofclaim 11, wherein the step of processing the distributed source audiodata comprises separating at least a vocal audio source therefrom, witha first sequence of algorithms implementing non-negative matrixfactorisations.
 14. The computer-implemented method of claim 11, whereinthe step of processing the distributed source audio data comprisesseparating at least a drums audio source therefrom, with a secondsequence of algorithms implementing non-negative matrix factorisations.15. A set of instructions recorded on a data carrying medium or storedat a network storage medium which, when read and processed by a dataprocessing terminal connected to a network, configures the terminal toperform a computer-implemented method for decomposing a digital audiosignal including mixed audio sources in a network, the method comprisingthe steps of: selecting a source audio signal data and a decompositiontype at a client terminal; uploading the source audio signal data anddata representative of the selected decomposition type to a queuingmodule; queuing the uploaded source audio data and distributing same toan audio data processing module from the queuing module; processing thedistributed source audio data into decomposed audio signal data at theaudio data processing modules with a sequence of algorithms implementingnon-negative matrix factorisations, wherein the sequence is determinedby the type selection data, uploading the decomposed audio signal datato the queuing module; and queuing the uploaded decomposed data anddistributing same to the client terminal from the queuing module.